How can we add and interpret categorical terms in a linear regression model?
How can we add and interpret interaction terms in a linear regression model?
How can we diagnose and add non-linear terms in a linear regression model?
How can we use regression model and counterfactual scenarios to predict outcome variable.
Midterm Exam
Categorical Terms
Agenda
Categorical Terms
Interaction Terms
Non-linear Terms
Which Term to Add?
Midterm
If X is a binary predictor, we have shown that \beta_1 is the average effect of X on Y.
Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i
If our explanatory variable is a categorical variable with more than 2 categories, we need to “binarize” it into multiple binary variables, each of which denotes one category.
Optional: if the categorical variable is numeric typc in R, we need to transform it to factor or character variable.
choose one category as the baseline/reference category
code each of the other categories as a binary variable, called a dummy variable.
ID
Group
1
Group 1
2
Group 2
3
Group 3
4
Group 4
\Rightarrow
ID
Group 2
Group 3
Group 4
1
0
0
0
2
1
0
0
3
0
1
0
4
0
0
1
Interpretation for coefficients of categorical variables
\beta_0: group mean of reference category
\beta_j(j \neq \text{ref}): the average effect of category j compared to the reference category
lm(primary2006 ~ messages, data = social) # intercept can be interpreted as the mean of the first quartile group, and coefficient can be interpret as the average effect of the other three quartile groups.
lm(primary2006 ~-1+ messages, data = social) # if we drop the intercept, then the coefficient can be directly interpreted as the means of each quartile group.
hypo_dat <-expand.grid(age =seq(from =25, to =85, by =20),messages =c("Control", "Neighbors"))hypo_dat$primary2006_pred <-predict(res, newdata = hypo_dat)hypo_dat
age messages primary2006_pred
1 25 Control 0.4896886
2 45 Control 0.4777178
3 65 Control 0.4657470
4 85 Control 0.4537763
5 25 Neighbors 0.4745173
6 45 Neighbors 0.4779976
7 65 Neighbors 0.4814778
8 85 Neighbors 0.4849581
plot(primary2006_pred ~ age, type ="l", col ="blue", data =subset(hypo_dat, messages =="Control"))lines(primary2006_pred ~ age, type ="l", col ="red", data =subset(hypo_dat, messages =="Neighbors"))legend("bottomright", title ="Household Size", c("Control", "Neighbors"), lty =1, col =c("blue", "red"))
Continuous + Continuous
res <-lm( primary2006 ~ age + hhsize + age:hhsize,data = social)res
Call:
lm(formula = primary2006 ~ age + hhsize + age:hhsize, data = social)
Coefficients:
(Intercept) age hhsize age:hhsize
0.5830103 -0.0014437 -0.0368314 0.0004591
plot(primary2006_pred ~ age, type ="l", col ="blue", data =subset(hypo_dat, hhsize ==2))lines(primary2006_pred ~ age, type ="l", col ="green", data =subset(hypo_dat, hhsize ==3))lines(primary2006_pred ~ age, type ="l", col ="red", data =subset(hypo_dat, hhsize ==4))legend("bottomright", c("2", "3", "4"), lty =1, col =c("blue", "green", "red"))
plot(primary2006_pred ~ hhsize, type ="l", col ="blue", data =subset(hypo_dat, age ==25), ylim =c(0, 0.5))lines(primary2006_pred ~ hhsize, type ="l", col ="green", data =subset(hypo_dat, age ==45))lines(primary2006_pred ~ hhsize, type ="l", col ="red", data =subset(hypo_dat, age ==65))legend("bottomright", title ="Age", c("25", "45", "65"), lty =1, col =c("blue", "green", "red"))
Sometimes the values of continuous covariates cannot be zero. The unrealistic “zero” scenario often makes our interpretation non-sense. To ease our interpretation, we can center our covariates, that is,
Download and load the health2.csv data from Canvas.
Subset the data set with the condition grepl("2017", date). That is, we only consider records in 2017.
Regress weight on dayofyear, steps.lag, calorie.lag. We call it a restricted model.
Plot residuals against each explanatory variable. Which non-linear term do we need to add?
Rerun the regression model, but this time we add a quadratic term of dayofyear, an interaction term between dayofyear and calorie.lag, and an interaction term between I(dayofyear^2) and calorie.lag. We call it an unrestricted model.
(Optional) Perform an F-test between restricted model and unrestricted model using anova().
Use the following code to create several counterfactual scenarios
Use the unrestricted model and counterfactual scenarios to predict weights.
Plot predicted values (y-axis) under different dayofyear (x-axis) when calorie.lag equals to the first quartile, median, and the third quartile, respectively.
# 1. load the datahealth <-read.csv("./data/health2.csv")# 2. subset the data to 2017health <-subset(health, grepl("2017", date))# 3. run regressionres <-lm(weight ~ dayofyear + steps.lag + calorie.lag, data = health)# 4. plot residuals against each explanatory variablepar(mfrow =c(1, 3))plot(x = health$dayofyear, y =residuals(res))plot(x = health$steps.lag, y =residuals(res))plot(x = health$calorie.lag, y =residuals(res))
# 5. re-run regression with quadratic and interaction termsquad_res <-lm(weight ~ (dayofyear +I(dayofyear^2)) * calorie.lag + steps.lag, data = health)# 6. F-testanova(res, quad_res)
Analysis of Variance Table
Model 1: weight ~ dayofyear + steps.lag + calorie.lag
Model 2: weight ~ (dayofyear + I(dayofyear^2)) * calorie.lag + steps.lag
Res.Df RSS Df Sum of Sq F Pr(>F)
1 210 3870.4
2 207 1158.1 3 2712.2 161.59 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 7. create counterfactual scenarioshypo_dat <-expand.grid(dayofyear =1:365,calorie.lag =quantile(health$calorie.lag, c(0.25, 0.5, 0.75)),steps.lag =mean(health$steps.lag))# 8. Use the unrestricted model and hypothetical scenario to predict weightshypo_dat$weight_pred <-predict(quad_res, hypo_dat)# 9. Plot predicted values under different dayofyear when calorie.lag equals to the first quartile, median, and the third quartile, respectively.plot(weight_pred ~ dayofyear, type ="l", col ="blue", data =subset(hypo_dat, calorie.lag ==quantile(health$calorie.lag, 0.25)))lines(weight_pred ~ dayofyear, type ="l", col ="green", data =subset(hypo_dat, calorie.lag ==quantile(health$calorie.lag, 0.5)))lines(weight_pred ~ dayofyear, type ="l", col ="red", data =subset(hypo_dat, calorie.lag ==quantile(health$calorie.lag, 0.75)))legend("bottomright", title ="Active Calorie", c("1st Quartile", "Median", "3rd Quartile"), lty =1, col =c("blue", "green", "red"))
Which Term to Add?
Agenda
Categorical Terms
Interaction Terms
Non-linear Terms
Which Term to Add?
Midterm
Diagnostics Based on Prediction and Precision
Scatter plot between residuals \hat{\varepsilon} and explanatory variable X
check whether there is an unexplained pattern in the data.
R^2 and F test (Not required currently)
If adding an addtional term can increase the predictive power, the sum of squared residuals in the new (unrestricted) model should be smaller than that of the old (restricted) model.
\text{F-statistic} = \frac{\text{(scaled) reduction in prediction error}}{\text{(scaled) remaining prediction error}} = \frac{(\text{RSS}_\text{restricted} - \text{RSS}_\text{unrestricted}) / q}{\text{RSS}_\text{unrestricted} / (n - k - 1)} \sim F_{q, n - k - 1}
Add Terms Based on Causality
Good Control: Confounders
D \leftarrow X \rightarrow Y
A confounder is the common cause for both D and Y, e.g. whether a student is a senior can both affect whether she uses ChatGPT and the test score.
set.seed(98105)X <-rnorm(1e3)D <-2* X +rnorm(1e3)Y <-5* D +10* X +rnorm(1e3)lm(Y ~ D) |>summary() # biased estimate of direct effect
Call:
lm(formula = Y ~ D)
Residuals:
Min 1Q Median 3Q Max
-13.771 -3.156 0.033 3.153 16.398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.11747 0.14848 -0.791 0.429
D 9.15366 0.06697 136.676 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.694 on 998 degrees of freedom
Multiple R-squared: 0.9493, Adjusted R-squared: 0.9492
F-statistic: 1.868e+04 on 1 and 998 DF, p-value: < 2.2e-16
lm(Y ~ D + X) |>summary() # including the confounder X will correct the estimate
Call:
lm(formula = Y ~ D + X)
Residuals:
Min 1Q Median 3Q Max
-2.87279 -0.65034 0.03951 0.65597 2.84164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02544 0.03100 0.821 0.412
D 4.97982 0.03147 158.264 <2e-16 ***
X 10.09999 0.06822 148.056 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9796 on 997 degrees of freedom
Multiple R-squared: 0.9978, Adjusted R-squared: 0.9978
F-statistic: 2.254e+05 on 2 and 997 DF, p-value: < 2.2e-16
Bad Control: Post-treatment variables
D \rightarrow X \rightarrow Y
A post-treatment variable is a mediator between D and Y, e.g. a student’s study time can be affected by whether the student uses ChatGPT, and her study time will then influence the test score.
set.seed(98105)D <-rnorm(1e3)X <-5* D +rnorm(1e3)Y <-10* X +rnorm(1e3)lm(Y ~ D) |>summary() # this will give us roughly correct estimate
Call:
lm(formula = Y ~ D)
Residuals:
Min 1Q Median 3Q Max
-38.749 -6.495 -0.012 6.507 34.479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3957 0.3125 1.266 0.206
D 49.4850 0.3057 161.866 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.883 on 998 degrees of freedom
Multiple R-squared: 0.9633, Adjusted R-squared: 0.9633
F-statistic: 2.62e+04 on 1 and 998 DF, p-value: < 2.2e-16
lm(Y ~ D + X) |>summary() # including the post-treatment variable X will mask the relationship between D and Y
Call:
lm(formula = Y ~ D + X)
Residuals:
Min 1Q Median 3Q Max
-2.87279 -0.65034 0.03951 0.65597 2.84164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02544 0.03100 0.821 0.412
D 0.16053 0.15844 1.013 0.311
X 9.97982 0.03147 317.170 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9796 on 997 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
F-statistic: 1.384e+06 on 2 and 997 DF, p-value: < 2.2e-16
Bad Control: Colliders
Y \rightarrow X \leftarrow D
A collider is the common consequence of both D and Y, e.g. a student’s evaluation on TA is influenced by both whether the student uses ChatGPT and whether the student gets good test scores.
set.seed(98105)D <-rnorm(1e3)X <-5* D + Y +rnorm(1e3)lm(Y ~ D) |>summary() # D and Y have no correlation
Call:
lm(formula = Y ~ D)
Residuals:
Min 1Q Median 3Q Max
-38.749 -6.495 -0.012 6.507 34.479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3957 0.3125 1.266 0.206
D 49.4850 0.3057 161.866 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.883 on 998 degrees of freedom
Multiple R-squared: 0.9633, Adjusted R-squared: 0.9633
F-statistic: 2.62e+04 on 1 and 998 DF, p-value: < 2.2e-16
lm(Y ~ D + X) |>summary() # including the collider X will create information flows between D and Y
Call:
lm(formula = Y ~ D + X)
Residuals:
Min 1Q Median 3Q Max
-0.266959 -0.060928 0.001974 0.059513 0.250430
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0019968 0.0028121 0.710 0.4778
D -0.0256371 0.0143557 -1.786 0.0744 .
X 0.9096635 0.0002589 3513.841 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08885 on 997 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.683e+08 on 2 and 997 DF, p-value: < 2.2e-16
Shutting the Backdoor
Include all pre-treatment variables in regression models.